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Abstract We present the Procrustes measure, a novel measure based on Procrustes rotation 
that enables quantitative comparison of the output of manifold-based embedding algorithms 
(such as LLE (Roweis and Saul 2000) and Isomap (Tenenbaum et al 2000 )). The measure 
also serves as a natural tool when choosing dimension-reduction parameters. We also present 
two novel dimension-reduction techniques that attempt to minimize the suggested measure, 
and compare the results of these techniques to the results of existing algorithms. Finally, 
we suggest a simple iterative method that can be used to improve the output of existing 
algorithms. 
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1 Introduction 

Technological advances constantly improve our ability to collect and store large sets of data. 
The main difficulty in analyzing such high-dimensional data sets is, that the number of 
observations required to estimate functions at a set level of accuracy grows exponentially 
with the dimension. This problem, often referred to as the curse of dimensionality, has led 
to various techniques that attempt to reduce the dimension of the original data. 

Historically, the main approach to dimension reduction is the linear one. This is the ap- 
proach used by principle component analysis (PCA) and factor analysis (see Mardia et al 
|1979| for both). While these algorithms are largely successful, the assumption that a linear 
projection describes the data well is incorrect for many data sets. A more realistic assump- 
tion than that of an underlying linear structure is that the data is on, or next to, an embedded 
manifold of low dimension in the high-dimensional space. Here a manifold is defined as a 
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topological space that is locally equivalent to a Euclidean space. Locally, the manifold can 
be estimated by linear approximations based on small neighborhoods of each point. Many 
algorithms were developed to perform embedding for manifold-based data sets, including 
the algorithms suggested by Roweis and Saul (2000); Tenenbaum et al (2000); Belkin and 
|Niyogi|((2003^;|Zhang and Zha| ( |2004| l; |Donoho and Grimes| l |2004fr ; |Weinberger and Saul| 
( |2006[ ); [Dollar et al 1 2007| >. Indeed, these algorithms have been shown to succeed even where 
the assumption of linear structure does not hold. However, to date there exists no good tool 
to estimate the quality of the result of these algorithms. 

Ideally, the quality of an output embedding could be judged based on a comparison to 
the structure of the original manifold. Indeed, a measure based on the idea that the manifold 
structure is known to a good degree was recently suggested by Dollar et al ( 2007 1. However, 
in the general case, the manifold structure is not given, and is difficult to estimate accurately. 
As such ideal measures of quality cannot be used in the general case, an alternate quantitative 
measure is required. 

In this work we suggest an easily computed function that measures the quality of any 
given embedding. We believe that a faithful embedding is an embedding that preserves the 
structure of the local neighborhood of each point. Therefore the quality of an embedding is 
determined by the success of the algorithm in preserving these local structures. The function 
we present, based on the Procrustes analysis, compares each neighborhood in the high- 
dimensional space and its corresponding low-dimensional embedding. Theoretical results 
regarding the convergence of the proposed measure are presented. 

We further suggest two new algorithms for discovering the low-dimensional embedding 
of a high-dimensional data set, based on minimization of the suggested measure function. 
The first algorithm performs the embedding one neighborhood at a time. This algorithm is 
extremely fast, but may suffer from incremental distortion. The second algorithm, based on 
simulated annealing ( Kirkpat rlck et al| |1983| , performs the embedding of all local neigh- 
borhoods simultaneously. A simple iterative procedure that improves on an existing output 
is also presented. 

The paper is organized as follows. The problem of manifold learning is presented in 
Section [2] A discussion regarding the quality of embeddings in general and the suggested 
measure of quality are presented in Section [3] The embedding algorithms are presented 
in Section]?] In Section [5] we present numerical examples. All proofs are presented in the 
Appendix. 



2 Manifold-learning problem setting and definitions 

In this section we provide a formal definition of the manifold-learning dimension-reduction 
problem. 

Let ./# be a tZ-dimensional manifold embedded in R ? . Assume that a sample is taken 
from ./#. The goal of manifold-learning is to find a faithful embedding of the sample in 

The assumption that the sample is taken from a manifold is translated to the fact that 
small distances on the manifold can be approximated well by the Euclidian distance in 
W. Therefore, to find an embedding, one first needs to approximate the structure of small 
neighborhoods on the manifold using the Euclidian metric in W. Then one must find a 
unified embedding of the sample in Mf 1 that preserves the structure of local neighborhoods 
on ^£ . 

In order to adhere to this scheme, we need two more assumptions. First, we assume that 
^ is isometrically embedded in W . By definition, an isometric mapping between two man- 
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ifolds preserves the inner product on the tangent bundle at each point. Less formally, this 
means that distances and angles are preserved by the mapping. This assumption is needed 
because we are interested in an embedding that everywhere preserves the local structure of 
distances and angles between neighboring points. If this assumption does not hold, close 
points on the manifold may originate from distant points in M. d and vice versa. In this case, 
the structure of the local neighborhoods on the manifold will not reveal the structure of the 
original neighborhoods in R d . We remark here that the assumption that the embedding is 
isometric is strong but can be relaxed. One may assume instead that the embedding map- 
ping is conformal. This means that the inner products on the tangent bundle at each point 
are preserved up to a scalar c that may change continuously from point to point. Note that 
the class of isometric embeddings is included in the class of conformal embeddings. While 
our main discussion regards isometric embeddings, we will also discuss the conformal em- 
bedding problem, which is the framework of algorithms such as c-Isomap l |de Silva and| 
Tenenbaum 2003) and Conformal Embeddings (CE) ( |Sha an d Saul, 2005)1. 

The second assumption is that the sample taken from the manifold . # is dense. We need 
to prevent the situation in which the local neighborhood of a point, which is computed ac- 
cording to the Euclidian metric in R ? , includes distant geodesic points on the manifold. This 
can happen, for example, if the manifold is twisted. The result of having distant geodesic 
points in the same local neighborhood is that these distant points will be embedded close to 
each other instead of preserving the true geodesic distance between them. 

To define the problem formally, we require some definitions. 

The neighborhood of a point x, 6 ^# is a set of points X, that consists of points close 
to Xi with respect to the Euclidean metric in W . For example, neighbors can be A'-nearest 
neighbors or all the points in an e-ball around Xj, 

The minimum radius of curvature tq = ro(^#) is defined as follows: 



— = max (11 y(r) 111 
m r.t Ul II J 



r y,t 

where y varies over all unit-speed geodesies in ^# and t is in a domain of y. 

The minimum branch separation sq = sq(^) is defined as the largest positive number 
for which \\x — x\\ < sq implies d y g(x,x) < Kro, where x,xE ^£ andd^(x,x) is the geodesic 
distance between x and x (see Bernstein et al 2000. , for both definitions). 

We define the radius r(i) of neighborhood i to be 

r(i) = max U;— x-.|| 
je{i,...,mr J " 

where x- tj is the y-th out of the k(j) neighbors of x,-. Finally, we define r max to be the maximum 
over r(i) . 

We say that the sample is dense with respect to the chosen neighborhoods if r max < 
sq. Note that this condition depends on the manifold structure, the given sample, and the 
choice of neighborhoods. However, for a given compact manifold, if the distribution that 
produces the sample is supported throughout the entire manifold, this condition is valid with 
probability increasing towards 1 as the size of the sample is increased and the radius of the 
neighborhoods is decreased. 

We now state the manifold-learning problem more formally. Let D C R d be a com- 
pact set and let <j> : D — > R 9 be a smooth and invertible isometric mapping. Let ./# be the 
J-dimensional image of D in R 9 . Letxj , . . . ,x„ be a sample taken from Define neighbor- 
hoods X, for each of the points x,. Assume that the sample x\, . . . ,x n is dense with respect to 
the choice of Xi . Find y\ , . . . ,y n £ R d that approximate <j> ~ 1 (x\ ) , . . . , <j> ~ 1 (x„ ) up to rotation 
and translation. 
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3 Faithful embedding 

As discussed in Section[2] a. faithful embedding should preserve the structure of local neigh- 
borhoods on the manifold, while finding a global embedding mapping. 
In this section, we will attempt to answer the following two questions: 

1. How do we define "preservation of the local structure of a neighborhood"? 

2. How do we find a global embedding that preserves the local structure? 

We now address the first question. Under the assumption of isometry, it seems reason- 
able to demand that neighborhoods on the manifold and their corresponding embeddings be 
closely related. A neighborhood on the manifold and its embedding can be compared using 
the Procrustes statistic. The Procrustes statistic measures the distance between two configu- 
rations of points. The statistic computes the sum of squares between pairs of corresponding 
points after one of the configurations is rotated and translated to best match the other. 

In the remainder of this paper we will represent any set of k points x\ , . . . £ K 9 as a 
matrix X^, xq = \pd l ,... ,x^]; i.e., the y'-th row of the matrix X corresponds to xj. 

Let X be a A-point set in R 9 and let 7 be a &-point set in R d , where d < q. We define the 
Procrustes statistic G(X,Y) as 

k 

G{X,Y) = inf V \\xi-Ayi-bf (1) 
inf tr {(X -YA' -\b')'(X-YA' -\b')) 

{A,b:A'A=I,h^ a ^ v 



where the rotation matrix A is a columns-orthogonal q x d matrix, A' is the adjoint of A, and 
1 is a fc-dimensional vector of ones. 

The Procrustes rotation matrix A and the Procrustes translation vector b that minimize 
G(X, Y) can be computed explicitly, as follows. Let Z = X'HY where H = I — ill' is the 
centering matrix. Let ULV' be the singular-value decomposition (svd) of Z, where U is 
an orthogonal q x d matrix, L is a non-negative d x d diagonal matrix, and V is a d x d 
orthogonal matrix Mardia et al||1979) . Then, the Procrustes rotation matrix A is given by 



UV' < Sibson 1978 . The Procrustes translation vector b is given by x — Ay, where x and y 
are the sample means of X and Y, respectively. Due to the last fact, we may write G(X,Y) 
without the translation vector b as 



G(X,Y)= inf tr ((X - YA')'H(X - YA')) = inf \\H(X -YA')\\l , 

{A:A'A=I} V ' {A:A'A=l} U UF 

where || \\ F is the Frobenius norm. 

Given X, the minimum of G(X,Y) can be computed explicitly and is achieved by the 
first d principal components of X. This result is a consequence of the following lemma. 

Lemma 1 Let X = X^ X9 be a centered matrix of rank q and let d < q. Then 

. inf \\X-X\\ 2 F (2) 

{X:rank(X)=a'} 



is obtained when X equals the projection ofX on the subspace spanned by the first d prin- 
cipal components ofX. 
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For proof, see |Mardia et al| ( fl979] page 220). 

Returning to the questions posed at the beginning of this section, we will define how well 
an embedding preserves the local neighborhoods using the Procrustes statistic G(X,-,F;) of 
each neighborhood-embedding pair (X;, 7;). Therefore, a global embedding that preserves 
the local structure can be found by minimizing the sum of the Procrustes statistics of all 
neighborhood-embedding pairs. 

More formally, let X be the g-dimensional sample from the manifold and let 7 be a 
d-dimensional embedding of X. Let Xj be the neighborhood of x, (i = 1, . . . ,«) and 7, its 
embedding. Define 

R(X,Y) = -j^G(X i ,Y i ). (3) 

The function R measures the average quality of the neighborhood embeddings. Embed- 
ding Y is considered better than embedding 7 in the local-neighborhood-preserving sense 
if R(X,Y) < R(X,Y). This means that on the average, Y preserves the structure of the local 
neighborhoods better than 7. 

The function R(X,Y) is sensitive to scaling, therefore normalization should be consid- 
ered. A reasonable normalization is 

R N (X,Y) = -j^G(X i ,Y i )/\\HX i \\ 2 F . (4) 

The z'-th summand of Rn(X,Y) examines how well the rotated and translated 7, "explains" 
Xi, independent of the size of Xj. This normalization solves the problem of increased weight- 
ing for larger neighborhoods that exists in the unnormalized R(X, 7). It also allows compar- 
ison of embeddings for data sets of different sizes. Hence, this normalized version is used to 
compare the results of different outputs (see Section[5]l. 

In the remainder of this section, we will justify our choice of the Procrustes measure R 
for a quantitative comparison of embeddings. We will also present two additional, closely 
related measures. One measure is Rpca, which can ease computation when the input space 
is of high dimension. The second measure is Rq, which is a statistic designed for conformal 
mappings (see Section [21. Finally, we will discuss the relation between the measures sug- 
gested in this work to the objective functions of other algorithms, namely LTSA (Zha ng and| 
[Zha|[2004l l and SDE ( [Weinberger and Saul[[2006l l. 

We now justify the use of the Procrustes statistic G(X, , 7, ) as a measure of the quality of 
the local embedding. First, G(X,,7) estimates the relation between the entire input neigh- 
borhood and its embedding as one entity, instead of comparing angles and distances within 
the neighborhood with those within its embedding. Second, the Procrustes statistic is not 
highly sensitive to small perturbations of the embedding. More formally, G(X,Y) = 0{e 2 ), 
where 7 = X + eZ and Z is a general matrix (see Sib son|[l979| l. Finally, the function G is 
h -norm-based and therefore prefers small differences at many points to big differences at 
fewer points. This is preferable in our context, as the local embedding of the neighborhood 
should be compatible with the embeddings of nearby neighborhoods. 

The usage of R as a measure of the quality of the global embedding of the manifold is 
justified by Theorem[T] Theorem[T]claims that when the number of input points X increases, 
the low-dimensional points Z = <j> (X) of the input data tend to zero R. This implies that the 
minimizer 7 of R should be close to the original data set Z (up to rotation and translation). 
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Theorem 1 Let S be a compact connected set. Let (j) : 3> — > R 9 be an isometry. LetX^ be 
an n-point sample taken from (j>(2$), and let Z'"' = (X^). Assume that the sample 
is dense with respect to the choice of neighborhoods for all n > Nq. Then for all n > Nq 

R(xM,2p>)=ff(£ ax ). (5) 

See Appendix | A. 1 1 for proof. 

Replacing R(X,Y) with the normalized version R^(X,Y) (see Eq. [4| and noting that 

\\HXi\\ 2 F = ff(rf),we obtain 

Corollary 1 

R N (xW,zW) = i?(r 2 max ). 

To avoid heavy computations, a slightly different version of R(X,Y) can be considered. 
Instead of measuring the difference between the original neighborhoods on the manifold and 
their embeddings, one can compare the local PCA projections ( Mar dmet al| 1 1 979} of the 
original neighborhoods with their embeddings. We therefore define 

R P CA(X,Y) = -f j G(X i P i ,Y i ), (6) 

where Pi is the J-dimensional PCA projection matrix of X,. 

The convergence theorem for Rpca is similar to Theorem [TJ but the convergence is 
slower. 

Theorem 2 Let & be a compact connected set. Let (j) : 2# — > W* be an isometry. LetX^ be 
an n-point sample taken from (j) {&), and let = {X^). Assume that the sample X" 
is dense with respect to the choice of neighborhoods for all n > Nq. Then for all n > No 

R PCA (x("\z^)=&(ri ax ), 

See Appendix | A.2| for proof. 

We now present another version of the Procrustes measure Rc(X,Y), suitable for con- 
formal mappings. Here we want to compare between each original neighborhood X; and its 
corresponding embedding Y,, where we allow Y { not only to be rotated and translated but 
also to be rescaled. Define 

G C (XJ)= inf tr ( (X- Y(cA'))' H (X -Y(cA h 

V ' {A:A'A=Ifl<ceR\ V V V " V ' 

Note that the scalar c was introduced here to allow scaling of Y. Let Z = X'HY and let 
ULV' be the svd of Z. The minimizer rotation matrix A is given by UV' . The minimizer con- 
stant c is given by tr (L) /tr(F'y)(see Sibson, 19781. The (normalized) conformal Procrustes 
measure is given by 

R C (XJ) = X j^G c {X i ,Y i )/\\HX i \\ 2 F . (7) 

Note that Rq {X, Y) <R^(X,Y) since the constraints are relaxed in the definition of Rq {X, Y). 
However, the lower bound in both cases is equal (see Lemma[T]i. 
We present a convergence theorem, similar to that of R and Rpca- 
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Theorem 3 Let 3> be a compact connected set. Let <p : @ — > R 9 be a conformal mapping. 
Let be an n-point sample taken from ^(S 1 ), and let Z'"' = _I (Z"). Assume that the 
sample is dense with respect to the choice of neighborhoods for all n > iVo. Then for all 
n>Nowe have 

RrU {n \Z^\=mrl, 



max ! 



See Appendix A. 3 1 for proof. Note that this result is of the same convergence rate as of R^ 
(see Corollary Tb. 

A cost function somewhat similar to the measure RpcA was presented by Zh ang arid| 
|Zha| ( |2004) l in a slightly different context. The local PC A projection of neighborhoods was 
used as an approximation of the tangent plane at each point. The resulting algorithm, local 
tangent subspaces alignment (LTSA), is based on minimizing the function 



£||(/-7yf)ffF; 
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where the k(i) x d matrices P; are as in Rpca- The minimization performed by LTSA is 
under a normalization constraint. This means that as a measure, LTS A's objective function is 
designed to compare normalized outputs Y (otherwise Y = would be the trivial minimizer) 
and is therefore unsuitable as a measure. 

Another algorithm worth mentioning here is SDE (Weinber ger and Saul| |2006] >. The 
constraints on the output required by this algorithm are that all the distances and angles 
within each neighborhood be preserved. Therefore the output of this algorithm should al- 
ways be close to the minimum of R(X,Y). The maximization of the objective function of 
this algorithm 

Elh-^ll 2 

u 

is reasonable when the aforementioned constraints are enforced. However, it is not relevant 
as a measure for comparison of general outputs that do not fulfill these constraints. 

In summary, we return to the questions posed at the beginning of this section. We choose 
to define preservation of local neighborhoods as minimization of the Procrustes measure R 
(see Eq.[3]l. We therefore find a global embedding that best preserves local structure by min- 
imizing R. For computational reasons, minimization of Rpca (see Eq. [6J may be preferred. 
For conformal maps we suggest the measure Rc (see Eq.|7]i, which allows separate scaling 
of each neighborhood. 



4 Algorithms 

In Section[3]we showed that a faithful embedding should yield low R(X,Y) and Rpca(X,Y) 
values. Therefore we may attempt to find a faithful embedding by minimizing these func- 
tions. However, R(X, Y) and Rpca (X,Y) are not necessarily convex functions and may have 
more than one local minimum. In this section we present two algorithms for minimization 
of R(X,Y) or Rpca{X,Y). In addition, we present an iterative method that can improve the 
output of the two algorithms, as well as other existing algorithms. 

The first algorithm, greedy Procrustes (GP), performs the embeddings one neighborhood 
at a time. The first neighborhood is embedded using the PCA projection (see Lemma[T|l. At 
each stage, the following neighborhood is embedded by finding the best embedding with 
respect to the embeddings already found. This greedy algorithm is presented in Section pTTj 
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The second algorithm, Procrustes subspaces alignment (PSA), is based on an align- 
ment of the local PCA projection subspaces. The use of local PCA produces a good low- 
dimensional description of the neighborhoods, but the description of each of the neighbor- 
hoods is in an arbitrary coordinate system. PSA performs the global embedding by finding 
the local embeddings and then aligning them. PSA is described in Section |4~2| Simulated 
annealing (SA) is used to find the alignment of the subspaces (see Section |4.3[ >. 

After an initial solution is found using either GP or PSA, the solution can be improved 
using an iterative procedure until there is no improvement (see Section[4~4|. 



4.1 Greedy Procrustes (GP) 

GP finds the neighborhoods' embeddings one by one. The flow of the algorithm is described 
below. 



1. Initialization: 

- Find the neighbors X, of each point x-, and let Neighbors (i) be the indices of the 
neighbors Xj. 

- Initialize the list of all embedded points' indices to TV := 0. 

2. Embedding of the first neighborhood: 

- Choose an index i (randomly). 

- Calculate the embedding Y; = XjPi, where P, is the PCA projection matrix of X,. 

- Update the list of indices of embedded points N = Neighbors(i). 

3. Find all other embeddings (iteratively): 

- Find /, where Xj is the unembedded point with the largest number of embedded 
neighbors, 

; = argmax^ |{Neighbors(p) n N}\. 

- Define Xj = {x p \p 6 Neighbors^) HiV}, the points in Xj that are already embed- 
ded. 

Define Yj as the (previously calculated) embedding of Xj. 

Define Xj = {x p \p 6 Neighbors^) \ Af}, the points inXj that are not embedded yet. 

- Compute the Procrustes rotation matrix Aj and translation vector bj between Xj and 

- Define the embedding of the points in Xj as Yj = XjAj + bj. 

- Update the list of indices of embedded points N 
N := 2VU Neighbors (/). 

4. Stopping condition: 

Stop when \N\ = n, i.e., when all points are embedded. 



The main advantage of GP is that it is fast. The embedding forX,- is computed in &(k(i) 3 ) 
where k(i) is the size of X,. Therefore, the overall computation time is ff(nK 3 ), where K = 
max, &(i). While GP does not claim to find the global minimum, it does find an embedding 
that preserves the local neighborhood's structure. The main disadvantage of GP is that it has 
incremental errors. 
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4.2 Procrustes Subspaces Alignment (PSA) 

R(X,Y) can be written in terms of the Procrustes rotation matrices A, as 

R(X,Y) = -f t inf 11/7(^-7^)^, 

where // is the centering matrix. A, can be calculated, given X and Y. However, as Y is not 
given, one way to find Y is by first guessing the matrices A,- and then finding the Y that 
minimizes 

R(X,Y\A u ...,An) = -£ j \\H(X i -Y i A' i )\\ 2 F . (8) 

Y can be found by taking derivatives of R(X,Y\A\, . . . ,A„). 

We therefore need to choose A,- wisely. The choice of the Jacobian matrices 7, = J<f, (z, ) 
as a guess for A,- is justified by the following, as shown in the proof of Theorem [T] (see 
Eq. 14 1. As the size of the sample is increased, - £" =i \\H(Xj — Z;/-) || — ► 0. This means that 
choosing J, will lead to a solution Y that is close to the minimizer of R(X,Y). 

The PCA projection matrices P; approximate the unknown Jacobian matrices up to a 
rotation. To use them in place of we must first align the projections correctly. Therefore, 
our guess for A,- is of the form A,- = P,0,, where O,- are d x d rotation matrices that minimize 

/(A 1 ,...,A„) = f £ \\ A >- A j\f F - (9) 

i'=l 7'eNeighbors(i) 



The rotation matrices O, can be found using simulated annealing, as described in Section pP] 
Once the matrices A, are found, we need to minimize R(X,Y\Ai , . . . ,A„). We first write 
R(X,Y\A\, . , . ,A„) in matrix notation as 

R(X,Y\A U . . . ,A„) = - £ tr ((X - YA^'H^X - YAty . 

Here //, is the centering matrix of neighborhood X; 

1 — F^T) ^ = ' an d£ ^ Neighbors (i) 
Ht(k,l) = ^ -™ it / /andfc,/ e Neighbors(i') 
elsewhere . 

The rows of the matrix H{X at Neighbors(z') indices are x;'s centered neighborhood, 
where all the other rows equal zero, and similarly for H\Y . . 



Taking the derivative of R{X,Y\A\ ,.. . ,A„) (Eq. with respect to Y (see Mardia et al 



1979 1 and using the fact that A\A-, = I, we obtain 



jr-R(X,Y\A lt . . . ,A„ = £ tf/XA,- - - V HiY . (10) 
dY n ~ n r-i 



Using the general inverse of Y!i=\ Hi we can write Y as 



Y =\L H n L R i XA i- 



Summarizing, we present the PSA algorithm. 
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1. Initialization: 

- Find the neighbors X; of each point x,. 

- Find the PC A projection matrices Pi of the neighborhood Xj. 

2. Alignment of the projection matrices: 

Find A, that minimize Eq. fusing, for example, simulated annealing (see Section [43| . 

3. Find the embedding: 
Compute Y according to Eq.[TT| 

The advantage of this algorithm is that it is global. The computation time of this al- 
gorithm (assuming that the matrices A; are already known) depends on multiplying by the 
inverse of the sparse symmetric semi-positive definite matrix £" =1 //,-, which can be very 
costly. However, this matrix need not be computed explicitly. Instead, one may solve d- 
linear-equation systems of the form (XH=\H-i) x = b, which can be computed much faster 
(see, for example, Munksgaard 1980}. 



4.3 Simulated Annealing (SA) alignment procedure 



In step 2. of PSA (see Section |4.2| i, it is necessary to align the PCA projection matri- 
ces Pj. In the following we suggest an alignment method based on simulated annealing 
(SA) ( Kirkpatrick et al , 1983 ). The aim of the suggested algorithm is to find a set of columns- 
orthonormal matrices Ai,...,A„ that minimize Eq. [9] A number of closely-related algo- 
rithms, designed to find embeddings using alignment of some local dimensionally-reduced 
descriptions, were previously suggested. Roweis et al (2 001| > and |Verbeek et al (2002) intro- 
duced algorithms based on probabilistic mixtures of local FA and PCA structures, respec- 
tively. As Eq. [9] and these two algorithms suffer from local minima, the use of simulated 
annealing may be beneficial. Another algorithm, suggested by Teh and Roweis (2003), uses 
a convex objective function to find the alignment. The output matrices of this algorithm are 
not necessarily columns-orthonormal, as is required in our case . 

Minimizing Eq.[9]is similar to the Ising model problem (see, for example, Cipra, 1987| >. 
The Ising model consists of a neighbor-graph and a configuration space that is the set of all 
possible assignments of + 1 or — 1 to each vertex of the graph. A low-energy state is one 
in which neighboring points have the same sign. Our problem consists of a neighbor-graph 
with a configuration space that includes all of the rotations of the projection matrices Pi 
at each point x t . Minimizing the function / is similar to finding a low-energy state of the 
Ising model. As solutions to the Ising model usually involve algorithms such as simulated 
annealing , we take the same path here. 

We present the SA algorithm, following the algorithm suggested by Si arry et al| ( |1997| ), 
modified for our problem. 

1. Initialization: 

- Choose an initial configuration (for example, using GP). 

- Compute the initial temperature (see Siarry et al]|1997| ). 



- Define the cooling scheme (see |Siarry et al| 1997| )T 
2. Single SA step: 

- Choose i randomly. 

Generate a random d x d rotation matrix 0, (see Stewart. [T980J. 
Define Af ew =A,0,. 

- Compute /(Ai,...,Af ew ,...,A„). 

Note that it is enough to compute ENeighbors(i) 1 | Af ew — Aj \\ F , 
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- Accept Aj if either 

f(A 1 ,...,Af ew ,...,A n )<f(A u ...,A i ,...,A n ) 

or with some probability depending on the current temperature. 

- Decrease the temperature and stop if the lowest temperature is reached. 
3. Outer iterations: 

- First iteration: Perform a run of SA on all matrices A\ , . . . ,A„. 

Find the largest cluster of aligned matrices (for example, use BFS ( [Corman et al| 
|1990| > and define an alignment criterion). 

- Other iterations: Apply SA to the matrices that are not in the largest cluster. Update 
the largest cluster after each run. 

- Repeat until the size of the largest cluster includes almost all of the matrices A,-. 

Using SA is complicated. The cooling scheme requires the estimation of many parame- 
ters, and the run-time depends heavily on the correct choice of these parameters. For output 
of large dimension, the alignment is difficult, and the output of SA can be poor. Although SA 
is a time-consuming algorithm, each iteration is very simple, involving only 0(Kqd 3 ) oper- 
ations, where K is the maximum number of neighbors, and q and d are the input and output 
dimensions, respectively. In addition, the memory requirements are modest. Therefore, SA 
can run even when the number of points is large. 



4.4 Iterative procedure 

Given a solution of GP, PSA, or any other technique, it is usually possible to modify Y so 
that R(X,Y) is further decreased. The idea of the iterative procedure we present here was 
suggested independently by Zhang and Zha ( 2004), but no details were supplied. 

In Section |4~2| we showed that given Y, the improved matrices A, are obtained by finding 
the Procrustes matrices between Y { and X,-. Given the matrices A,-, the embedding Y can be 
found using Eq. [TT] An iterative procedure would require finding first the new matrices A,- 
and then a new embedding Y at each stage. This would be repeated until the change in the 
value of R(X, Y) was small. 

The problem with this scheme is that it involves the computation of the inverse of the 



matrix Y!l=\Hi (see end of Section 4.2 1. We therefore suggest a modified version of this 



iterative procedure, which is easier to compute. Recall that 

R(X,Y) = £ E \\xj-Aaj-bi 

i=l ;'GNeighbors(i) 

The least-squares solution for £>,- is 

1 



The least-squares solution for yj is 

1 



—-z- V AUxj-bi). (13) 
j e Neighbors (j)} r . . ri. v 

L J b WJ {jyeNeighborsfi)} 

Note that we get a different solution than that in Eq.[l0] The reason is that here we consider 
bi as constants when we look for a solution for yj. In fact, yj appear in the definition of 
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the bj. However, as yj appear there multiplied by l/k(i), these terms make only a small 
contribution. 

We suggest performing the iterations as follows. First, find the Procrustes rotation ma- 
trices A, and the translation vectors using Eq. 12 Then find yj using Eq. 13 Repeat until 
R(X,Y) no longer decreases significantly. 



5 Numerical Examples 

In this section we evaluate the algorithms GP and PSA on data sets that we assumed to be 
sampled from underlying manifolds. We compare the results to those obtained by LLE fRowq' is 
|and Saul[|2000| l, Isomap ( |Tenenbaum etalj [2000) 1, and LTSA ( |Zhang and Zha] |2"0"04| >, both 
visually and using the measures R N (X,Y) and R C (X,Y) (see Table|2]. 

The algorithms GP and PSA were implemented in the Matlab environment, running on 
a Pentium 4 with a 3 Ghz CPU and 0.98 GB of RAM. The alignment stage of PSA was 
implemented using SA (see Section |4~3]>. The runs of both GP and PSA were followed by 



the iterative procedure described in Section 4.4 to improve the minimization of R(X,Y) 



LLE, Isomap, and LTSA were evaluated using the Matlab code taken from the sites of the 
respective authors. The algorithm SDE (Weinberger and Saul, 2006), whose minimization 
is closest in spirit to ours, was also tested; however, it suffers from heavy computational 
demands, and the results of this algorithm could not be obtained using the code provided in 
the site. 

The data sets are described in Table[T] We ran all five algorithms using k = 6,9, 12, 15 
and 18 nearest neighbors. The minimum values for Rn(X,Y) and Rc{X,Y) are presented 
in Table [2] The results in all cases were qualitatively the same, therefore in the images we 
show the results for k = 12 only. 



Name 


n 


q 


d 


Description 


Figure 


Swissroll 


1600 


3 


2 


isometrically embedded in R 3 


Fig 1 


Hemisphere 


2500 


3 


2 


not isometrically embedded 
in R 3 


Kg0 


Cylinder 


800 


3 


2 


locally isometric to R 1 , 
has no global embedding in R 2 


Figg 


Faces 


1965 


560 


3 


20 x 28 pixel grayscale face images 
(see|Roweis||retrieved Nov. 2006} 


Figg 


Twos 


638 


256 


10 


images of handwritten "2"s 
from the USPS data set 
of handwritten digits ( Hull 1994 


None, due 
to output 
dimension 



Table 1 Description of five data sets used to compare the different algorithms, n is the sample size and q and 
d are the input and output dimensions, respectively. 



Overall, GP and PSA perform satisfactorily as shown both in the figures and in Table[2] 
The fact that in most of the examples GP and PSA get lower values than LLE, Isomap, 
and LTSA is perhaps not surprising, as GP and PSA are designed to minimize R(X,Y). The 
run-times of the algorithms excluding PSA is on the order of seconds while it takes PSA a 
few hours to run. Memory requirements of GP and PSA are lower than those of the other 
algorithms. As a consequence of the memory requirements, results could not be obtained 
for LLE, Isomap and LTSA for n > 3000. 
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(A) 





(C) 



(E) 



MM 




Fig. 1 A 1600-point sample taken from the three-dimensional Swissroll input is presented in (A). (B)-(F) 
show the output of GP, PSA, LLE, Isomap, and LTSA, respectively, for k = 12. Both GP and PSA, like 
Isomap, succeed in finding the proportions of the original data. 



Use of the measure R(X, Y) allows a quantitative comparison of visually similar outputs. 
Regarding the output of the cylinder (see Fig. [3}, for example, PSA and Isomap both give 
topologically sound results; however, R(X,Y) shows that locally, PSA does a better job. In 
addition, R(X,Y) can be used to optimize embedding parameters such as neighborhood size 
(see Fig|5]l. 





Swissroll 


Hemisphere 


Cylinder 


Faces 


Twos 


GP 


0.00 [0.00] 


0.02 [0.01] 


0.13 [0.01] 


0.45 [0.36] 


0.00 [0.00] 


PSA 


0.00 [0.00] 


0.03 [0.01] 


0.02 [0.01] 


0.35 [0.30 


0.00 [0.00] 


LLE 


0.81 [0.23] 


0.60 [0.00] 


0.73 [0.13] 


0.99 [0.79] 


0.82 [0.23] 


Isomap 


0.01 [0.01] 


0.03 [0.02] 


0.34 [0.25] 


0.5 [0.38] 


0.02 [0.01] 


LTSA 


0.99 [0.22] 


0.93 [0.04] 


0.59 [0.48] 


0.99 [0.53] 


0.98 [0.37] 


Lower Bound 


0.00 


0.00 


0.00 


0.11 


0.00 



Table 2 Comparison of the output of the different algorithms using Rn(X,Y) [Rc(X,Y)] as the measures of 
the quality of the embeddings. These values are the minima of both measures as a function of neighborhood 
size k, for k = 6,9, 12, 15, 18. The lower bound was computed using local PCA at each neighborhood (see 
Lemma[TJ. 
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Fig. 2 The input of a 2500-point sample taken from a hemisphere is presented in (A). (B)-(F) show the output 
of GP, PSA, LLE, Isomap, and LTSA, respectively, for k = 12. Both GP and PSA, like the other algorithms, 
perform the embedding, although the assumption of isometry does not hold for the hemisphere. 



6 Discussion 

In this section, we emphasize the main results of this work and indicate possible directions 
for future research. 

We demonstrated that overall, the measure R(X,Y) provides a good estimation of the 
quality of the embedding. It allows a quantitative comparison of the outputs of various 
embedding algorithms. Moreover, it is quickly and easily computed. However, two points 
should be noted. 

First, R(X,Y) measures only the local quality of the embedding. As emphasized in 
Fig. [3] even outputs that do not preserve the global structure of the input may yield rela- 
tively low R- values if the local neighborhood structure is generally preserved. This problem 
is shared by all manifold-embedding techniques that try to minimize only local attributes of 
the data. The problem can be circumvented by adding a penalty for outputs that embed dis- 
tant geodesic points close to each other. Distant geodesic points can be defined, for example, 
as points at least ^-distant on the neighborhood graph, with s > 1 . 

Second, R(X,Y) is not an ideal measure of the quality of embedding for algorithms that 
normalize their output, such as LLE (Roweis and Saul 2000), Laplacian Eigenmap ( |Belkin| 



|and Niyogi] |2003] ), and LTSA (Zhang and Zha 2004). This is because normalization of 
the output distorts the structure of the local neighborhoods and therefore yields high R- 
values even if the output seems to find the underlying structure of the input. This point (see 
also discussion in |Sha and Sau l 2005, Section 2) raises the questions, which qualities are 
preserved by these algorithms and how can one quantify these qualities. However, it is clear 
that these algorithms do not perform faithful embedding in the sense defined in Section [3] 
The measure Rc(X,Y) addresses this problem to some degree, by allowing separate scaling 
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(A) 



(B) 



(C) 




(E) 



(F) 




O 




Fig. 3 The input of an 800-point sample taken from a cylinder is presented in (A). (B)-(F) show the output 
of GP, PSA, LLE, Isomap, and LTSA, respectively, for k = 12. Note that the cylinder has no embedding in 
K 2 and it is not clear what is the best embedding in this case. While PSA, Isomap, and LLE succeeded in 
finding the topological ring structure of the cylinder, only PSA and LLE succeed in preserving the width of 
the cylinder. GP and LTSA collapse the cylinder and therefore fail to find the global structure, though they 
preform well for most of the neighborhoods (see Table[2j. 



of each neighborhood (see Table [2J. One could consider an even more relaxed measure 
which allows not only rotation, translations and scaling but a general linear transformation of 
each neighborhood. However, it is not clear what exactly such measure would quantify. Two 
new embedding algorithms were introduced. We discuss some aspects of these algorithms 
below. 

PSA, in the form we suggested in this work, uses SA to align the tangent subspaces at 
all points. While PSA works reasonably well for small input sets and low output dimension 
spaces, it is not suitable for large data sets. However, the algorithm should not be rejected 
as a whole. Rather, a different or modified technique for subspaces alignment, for example 
the use of landmarks ( |de Silva and Tenenbaum| |2003) >, is required in order to make this 
algorithm truly useful. 

GP is very fast (G(n) where n is the number of sample points), can work on very large 
input sets (even 100,000 in less than an hour), and obtains good results as shown both in 
Figs. |l|4| and in Table [2] This algorithm is therefore an efficient alternative to the existing 
algorithms. It may also be used to choose optimal parameters, such as neighborhood size 
and output dimension, before other algorithms are applied. R(X,Y) can be used for the 
comparison of GP outputs for varied parameters. 

An important issue that was not considered in depth in this paper is that of noisy in- 
put data. The main problem with noisy data is that, locally, the data seems g-dimensional, 
even if the manifold is ^-dimensional, d < q. To overcome this problem, one should choose 
neighborhoods that are large relative to the magnitude of the noise, but not too large with 
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Fig. 4 The projection of the three-dimensional output, as computed by PSA, on the first two coordinates 
(small points). The input used was a 1965-point sample of grayscale images of faces (see Table^. The boxes 
connected by lines are nearby points in the output set. The images are the corresponding face images from 
the input, in the same order. We see that nearby images in the input space correspond to nearby points in the 
output space. 



(A) (B) (C) 




(D) (E) (F) 




Fig. 5 The input of an 800-point sample taken from a cylinder is presented in (A). (B)-(F) show 
the output of LLE for k — 6,9,12,15 and 18, respectively. The respective values of Rc(X,Y) are 
0.25, 0. 15, 0. 13, 0. 19, 0. 17. While qualitatively the results are similar, R C (X, Y) indicates that k = 12 is opti- 
mal. 
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respect to the curvature of the manifold. If the neighborhood size is chosen wisely, both 
PSA and GP should overcome the noise and perform the embedding well (see Fig.[6]l. This 
is due to the fact that both algorithms are based on Procrustes analysis and PCA, which are 
relatively robust against noise. Further study is required to define a method for choosing the 
optimal neighborhood size. 




A Proofs 

A. 1 Proof of TheoremQ] 

In this section, we denote the points of neighborhood X, as .... ,x, ^ , where k(i) is the number of neighbors 
inX,. 



Proof In order to prove that R(X,Z) = ^ G(Z,.Z,) is ^(r^^), it is enough to show that for each ( 6 
1, . . . ,«, G(Xj,Zi) = ff(rf), where r, is the radius of the i-th neighborhood. The proof consists of replacing 
the Procrustes rotation matrix A, by /,- = A (z,), the Jacobian of the mapping at Zi- Note that the fact that (j> 
is an isometry ensures that Jjjj = I. The Procrustes translation vector b, is replaced by jc, — /;£;. 
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Recall that by definition xj —Xi = $(z.j) — 0(z,); therefore Xj —Xi = Ji{zi — Zj) + \\\zj — Z/ 1| 2 ) ■ Hence, 

Hi) .. ||2 
G(X,,Zi) = inf V \\x, -Mu - b,\\ (14) 

m I, || 2 

*(0 || || 2 



E< 

7=1 



is an isometry, therefore d^r(xipXf) = Viztj — Zi ■ where d^f is the geodesic metric. The sample is 

assumed to be dense, hence |^. — x,-|| < ,vo, where *o is me minimum branch separation (see Section^. 
Using Bernstein et al (2000. Lemma 3) we conclude that 



I , / s X II I 
ij -Zi\\ =djn{xi p Xi) < - \\xij -x t \ 



We can therefore write I ||z,-. — z;|| J = 0{rf), which completes the proof. 



A.2 Proof of Theorem[2] 

The proof of Theorem[2]is based on the idea that the PCA projection matrix P, is usually a good approximation 
of the span of the Jacobian . The structure of the proof is as follows. First we quantify the relations between 
XjP{ and XjJj, the projections of the i-th neighborhood using the PCA projection matrix and the Jacobian, 
respectively. Then we follow the lines of the proof of Theorem[T] using the bounds obtained previously. 

To compare the PCA projection subspace and tangent subspace at Xj we use the notation of angle between 
subspaces. Note that both subspaces are rf-dimensional and are embedded in the Euclidian space W 1 . The 
columns of the matrices P, and 7, consist of orthonormal bases of the PCA projection space and of the tangent 
space, respectively. Denote these subspaces by S*i and respectively. Surprisingly, the angle between 
S?i and J*i can be arbitrarily large. However, in the following we show that even if the angle between the 
subspaces is large, the projected neighborhoods are close. 

We start with some definitions. The principal angles o~i , . . . , Cy between and may be defined 
recursively for p = 1 , . . . , d as (see Golub and Loan 1 9 83) 

cos(o~„) = max max v w, 

subject to 

||v|| = ||w|| = 1, v'v* = 0,w'w k = ;k= l,...,p- 1. 

The vectors V[,...,vj and wi,...,wj are called principal vectors. 

The fact that P, and 7, have orthogonal columns leads t o a simple way to calcul ate the principal vectors 
and angles explicitly. Let ULV' be the svd of j'.Pi. Then (see |Golub and Loan||1983) 

1. vi,...,v,j are given by the columns of PjV. 

2. w [,..., are given by the columns of J,U. 

The relations between the two sets of vectors plays an important role in our computations. Write w„ = 
a„v„ + p„Vp, where a„ = w' v j}„ = \\w p — OtpV p \\ and vh = K ''' ,J i ■ Note that a„ is the cosine of the 

p-th principal angle between and Jfj. The angle between the subspaces is defined as arccos(cerf) and the 
distance between the two subspaces is defined to be f,in(aj). 

We now prove some basic claims related to the principal vectors. 

Lemma 2 Let Pj be the projection matrix of the neighborhood X, and let /, be the Jacobian of <j> at z,. Let 
ULV' be the svd ofJjPj and vi, . . . , \'j and w\, . . . ,wj be the columns of PjV and J/U, respectively. Then 
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1. vi , . . . , Vrf are an orthonormal vector system. 

2. w [,..., wj are an orthonormal vector system. 

3. v p ±w q for p=i q. 
4- Xv^forp^q. 

5. vj- ±v C jforq=\,...,d. 

Proof 

1. True, since PjV is an orthonormal matrix. 

2. True, since 7, J7 is an orthonormal matrix. 

3. Note that (JjU)' (P/V) = U'(J'jPj)V = L where L is a diagonal non-negative matrix. 
4. 

(A»Vp)'0V«) = ( w p - <V' P )'K - a P v q ) 

= w' p w q -v p 'w q - v' q w p + Vp'v, = . 



(/Vp)' v « = ( w l>- a P v p)' v q 



Using the relation between the principal vectors, we can compare the description of the neighborhood Xj 
in the local PCA coordinations and its description in the tangent space coordinations. Here we need to exploit 
two main facts. The first fact is that the local PCA projection of a neighborhood is the best approximation, in 
the ?2 sense, to the original neighborhood. Specifically, it is a better approximation than the tangent space in 
the h sense. The second is that in a small neighborhood of Xj, the tangent space itself is a good approximation 
to the original neighborhood. Formally, the first assertion means that 

*(f) ,, ||2 *(') n M 2 *(0 n 1,2 

I ^ik'K--*'-) >Lp',(x.i-m\ (i5) 

;=i" 1 j=i" 1 7=1 11 11 

while the second assertion means that 

*(0 ,, l|2 *(0 1 1 l|2 

I -I p?K-*) = ff i r t)- ( 16) 

7=1 7=1 
The proof of Eq.[l6]is straightforward. First note that 

(xij - Xj) = (xi. -Xi)- (xi-Xi) 

= Mzij - n) -Jiiii - Zi) + 0{r\) 



Hence 



,2 I, ii2 1 

(x,. -x,-) - \]\(x ij -xM = £ [w'p(x il -X/)) 2 
" " " " p=d+l 

II - - II 2 

= (Xij -Xi)~ JiJ'i {Xij - Xj ) 

= jjiizij - ii) - JiA (Ji(z h - Zi)) + 0{r}) || 2 
= H^)||W(r?), 

where wj^- 1 , . . . , w q are a completion of wi , . . . , wj to an orthonormal basis of W and we used the fact that 
J',J,=I- 

The following is a lemma regarding the relations between the PCA projection matrix and the Jacobian 
projection. It is a consequence of Eq.|15| 
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Lemma 3 

3. = <?(>i). 

4. (x !; -.v-,)'v^ = ^(r2). 

Proof 



1. and 2. follow from Eqs.|15|and |16| 

3. follows from the definition of r, . 

4. is a consequence of 1., indeed, 

«■('') d , 2 k ® II ||2 *(''' ii ||2 

E E ( v p'K -*)) < E N -*-) - E k'tffr, 

/=lp=l N 7=1" 11 y=l 11 11 

= 

We now prove TheoremOl Similarly to the proof of TheoremfT] it is enough to show that G(-X,P,,Z,) = 



m I, | |2 

G(X,T> u Zi) = inf V -A,z, -bi\\ 

< £ /?(^-ii)-o,-(z, y -a) 

7=1 

*(-■) ,. ||2 
= E -xd-OiJ'iixij -xfi + OiJKxij -xfi-Oiizij-zM 

7=1 

< £ -xi) - ojixij -m\ + E Woi^ -*.■) - 0,-(zo -z/) 

7=1 11 7=1 11 

= Exp 1 + Exp2 . 

where O, is some d x d orthogonal matrix. Note that due to its orthogonality, Exp2 is independent of the 
specific choice of O, . 

We choose O, = VU'. Rewriting Expl we obtain 

Expl = Y.\\ti(xii -Xi)-OiJl(xij -Xi)\\ 
7=1 

*W ,. ||2 
= 'L\\fi(x ij -x i )-VU'j' i (x ij -x i )\\ 

7=1 11 11 

*W ,. ||2 

= E r'^'K -*■) - (y'vw'juxij -i,o 

7=1 11 
*(') rf 2 

= E E (V( i ij" jt i)- w p(^-' ji )) ■ 

7=lp=l 
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This last expression brings out the difference between the description of the neighborhood Xj in the local 
PCA coordinations and its description in the tangent space coordinates. Using Lemma|2] we can write 

Hi) d 

Expl = E E ((v p - w p )'( Xij -Xi)) 2 (17) 

7=1 p=l 

m d 

= E E vp - u p v p - Pp v p)'( x h -*)) 
m d 

= EK 1 -« P )(VK-*)) 

m d 

- £ £ 2(1- a p )P p (v p '( Xlj -Xi)) (vfty -xt)) 

7=1 p=l 

m d 

+ EE^«'(^-*)) 2 . 

7=1 p=l 

We will use Lemma[3]to bound the first expression of the RHS. 

= E K'K- -*■) - E F 7 <'K 

7=1 11 11 7=1 11 11 

fc(") d 

= E E { -*)) 2 - ((«7.vp+^)'( Xi , -.f,)) 2 } 

7=lp=l 
7=lp=l 

m d 

+2 E E a pP P ( V p'(Xij (Vp'fe; 
7=lp=l 

*(/) rf 
7=1 p=l 

Note also that (1 - a p ) 2 < 1 - aj. Hence, 

*(') d k(i) d 

E E (i-«p) 2 (vp'(^.-%)) 2 < e E#(^fe;-*)) 

7=lp=l 7=1P=1 

*('") <i 

-2 e E «pMV(*.; -*>)) (vp'fe, 

7=1 p=l 

+0(4). 

Plugging it into Eq. |17| we get 

*(0 d 

Expi < EE 2 ^ 2 WK -*.■)) 

7=1 p=l 
k(i) d 

-EE 2ft>(vp'(x ; , -*)) -*)) + 

7=1 p=l 

where the last inequality is due to Lemma[3] 
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Proving that Exp2 is €>"(rf) is straightforward. 

*(;)., ||2 
Exp2 = £ mfiixij -xi) - Oiizij-m 

= E|k(*(,-*-)-(^-&)f=^('f) I 

/=l" 

which concludes the proof of Theorem|2] 



A.3 Proof of Theorem[3] 

The proof is similar to the proof of Theorem [T] (see Section[AjTJ. The proof consists of replacing the Pro- 
crustes rotation matrix A; and the constant c, by = Ji [zi), the Jacobian of the mapping at z,. Note that as 
is an conformal mapping which ensures that J\Jj = cl. The Procrustes translation vector fo, is replaced by 
Xi-JiZi. 

Recall that by definition xj —Xj = <j>(zj) — 0(z,); therefore Xj —Xi = Ji[zi — Zj) + & f ||zy — Z;|| j ■ Hence, 
G(X,,Z,)= inf T Xi ,-CiAiZu-biW (18) 

Aj.bj.Cj II J J II 

*(<) I, || 2 

j=i" 

As is an conformal mapping, we have c mm ||z^ — z;|| < d ^(xj^Xj), where djg is the geodesic metric 

and c m j n > is the minimum of the scale function c(z) measures the scaling change of at z . The minimum 
c m j n is attained as ® is compact. The last inequality holds true since the geodesic distance d j( (.v, . , x,- ) equals 
to the integral over c(z) for some path between z;. and z;. 

The sample is assumed to be dense, he nce — je^ 1 1 < so, where .so is the minimum branch separation 
(see Section[2). Using again Bernstein et al (2000. Lemma 3) we conclude that 

I II 1 , , % ft II II 
\\Zij -Zi < djflXij.x,) < — — \\xij Xj . 

II II Cmin ^C min II II 

We can therefore write (f ^||z^ — Zi|| ^ — @{ r f)- Dividing by the normalization for each neighbor- 

hood we obtain Rc(X,Y) = (r^. dx ) which completes the proof. 
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